#### Arquitectura de Computadores

#### Capítulo 2. Procesadores segmentados

Based on the original material of the book: D.A. Patterson y J.L. Hennessy "Computer Organization and Design: The Hardware/Software Interface" 4<sup>th</sup> edition.

Escuela Politécnica Superior Universidad Autónoma de Madrid

Profesores: G130 y G131: Iván González Martínez G136: Francisco Javier Gómez Arribas

### Agenda

The Processor: A Basic MIPS Implementation

- Building a Datapath
- Designing the Control Unit (single cycle)
- An Overview of Pipelining
  - Pipeline performance
  - MIPS five stages pipeline
  - Hazards: Structure, Data and Control
- MIPS Pipelined Datapath and Control
  - Data Hazards: Forwarding vs Stalling
  - Control Hazards: Branch prediction

### Introduction

- CPU performance factors
  - Instruction count
    - Determined by ISA and compiler
  - CPI and Cycle time
    - Determined by CPU hardware
- We will examine two MIPS implementations
  - A simplified version
  - A more realistic pipelined version
- Simple subset, shows most aspects
  - Memory reference: 1w, sw
  - Arithmetic/logical: add, sub, and, or, slt
  - Control transfer: beq, j

### **Introduction (2)**

We will study simple RISC processor called MIPS (Microprocessor without Interlocked Pipeline Stages)

- 32 bits processor (data, memory)
- 32 general purpose registers
- Separated data and code memory (Harvard architecture)

### **CPU Overview**

#### Instruction Execution

- PC → instruction memory, fetch instruction
- Register numbers → register file, read registers
- Depending on instruction class
  - Use ALU to calculate
    - Arithmetic result
    - Memory address for load/store
    - Branch target address
  - Access data memory for load/store
  - PC ← target address or PC + 4



### **Datapath & control design**

Datapath: Elements that process data and addresses in the CPU

Registers, ALUs, mux's, memories, …

We will build a MIPS datapath incrementally

## Control Unit: Information comes from the 32 bits of the instruction and the control lines select:

- Registers to be read (always read two).
- The operation to be performed by ALU
- If data memory is to be read or written
- What is written and where in the register file
- What goes in PC
- Combinational Single Cycle implementation

### **Full Datapath**



### **The Main Control Unit**

Control signals derived from instruction



### **ALU Control**

### ALU used for

- Load/Store: F = add
- Branch: F = subtract
- R-type: F depends on funct field

| ALU control | Function         |  |  |
|-------------|------------------|--|--|
| 0000        | AND              |  |  |
| 0001        | OR               |  |  |
| 0010        | add              |  |  |
| 0110        | subtract         |  |  |
| 0111        | set-on-less-than |  |  |
| 1100        | NOR              |  |  |

### **ALU Control**

# Assume 2-bit ALUOp derived from opcode Combinational logic derives ALU control

| opcode | ALUOp | Operation        | funct      | ALU function     | ALU control |
|--------|-------|------------------|------------|------------------|-------------|
| lw     | 00    | load word        | XXXXXX add |                  | 0010        |
| SW     | 00    | store word       | XXXXXX     | add              | 0010        |
| beq    | 01    | branch equal     | XXXXXX     | subtract         | 0110        |
| R-type | 10    | add              | 100000     | add              | 0010        |
|        |       | subtract         | 100010     | subtract         | 0110        |
|        |       | AND              | 100100     | AND              | 0000        |
|        |       | OR               | 100101     | OR               | 0001        |
|        |       | set-on-less-than | 101010     | set-on-less-than | 0111        |

### **Datapath With Control**



The Processor — 11

### **R-Type Instruction**



### **Load Instruction**



### **Branch-on-Equal Instruction**





### **Datapath With Jumps Added**



### **Performance Issues**

- Longest delay determines clock period
  - Critical path: load instruction
  - Instruction memory → register file → ALU → data memory → register file
- Not feasible to vary period for different instructions
- Violates design principle
  - Making the common case fast
- We will improve performance by pipelining

## Agenda

- A Basic MIPS Implementation
  - Building a Datapath
  - Designing the Control Unit (single cycle)
- An Overview of Pipelining
  - Pipeline performance
  - MIPS five stages pipeline
  - Hazards: Structure, Data and Control
- MIPS Pipelined Datapath and Control
  - Data Hazards: Forwarding vs Stalling
  - Control Hazards: Branch prediction

### **Pipelining Analogy**

# Pipelined laundry: overlapping executionParallelism improves performance



Four loads:

- Speedup
  - = 8/3.5 = 2.3
- Non-stop:
  - Speedup
    - = 2n/0.5n + 1.5 ≈ 4
    - = number of stages

### **MIPS** Pipeline

- Five stages, one step per stage
- 1. IF: Instruction fetch from memory
- 2. ID: Instruction decode & register read
- 3. EX: Execute operation or calculate address
- 4. MEM: Access memory operand
- 5. WB: Write result back to register

### **Pipeline Performance**

- Assume time for stages is
  - 100ps for register read or write
  - 200ps for other stages
- Compare pipelined datapath with single-cycle datapath

| Instr    | Instr fetch | Register<br>read | ALU op | Memory<br>access | Register<br>write | Total time |
|----------|-------------|------------------|--------|------------------|-------------------|------------|
| lw       | 200ps       | 100 ps           | 200ps  | 200ps            | 100 ps            | 800ps      |
| SW       | 200ps       | 100 ps           | 200ps  | 200ps            |                   | 700ps      |
| R-format | 200ps       | 100 ps           | 200ps  |                  | 100 ps            | 600ps      |
| beq      | 200ps       | 100 ps           | 200ps  |                  |                   | 500ps      |

### **Pipeline Performance**



### **Pipeline Speedup**

- If all stages are balanced
  - i.e., all take the same time
  - Time between instructions<sub>pipelined</sub>
    - = Time between instructions<sub>nonpipelined</sub> Number of stages
- If not balanced, speedup is less
- Speedup due to increased throughput
  - Latency (time for each instruction) does not decrease

### **Pipelining and ISA Design**

- MIPS ISA designed for pipelining
  - All instructions are 32-bits
    - Easier to fetch and decode in one cycle
    - c.f. x86: 1- to 17-byte instructions
  - Few and regular instruction formats
    - Can decode and read registers in one step
  - Load/store addressing
    - Can calculate address in 3<sup>rd</sup> stage, access memory in 4<sup>th</sup> stage
  - Alignment of memory operands
    - Memory access takes only one cycle

### Hazards

- Situations that prevent starting the next instruction in the next cycle
- Structure hazards
  - A required resource is busy
- Data hazard
  - Need to wait for previous instruction to complete its data read/write
- Control hazard
  - Deciding on control action depends on previous instruction

### **Structure Hazards**

- Conflict for use of a resource In MIPS pipeline with a single memory
  - Load/store requires data access
  - Instruction fetch would have to stall for that cycle
    - Would cause a pipeline "bubble"
- Hence, pipelined datapaths require separate instruction/data memories
  - Or separate instruction/data caches

### **Data Hazards**

- An instruction depends on completion of data access by a previous instruction
  - add \$s0, \$t0, \$t1
    sub \$t2, \$s0, \$t3



## Forwarding (aka Bypassing)

Use result when it is computed

- Don't wait for it to be stored in a register
- Requires extra connections in the datapath



### Load-Use Data Hazard

Can't always avoid stalls by forwardingIf value not computed when needed

Can't forward backward in time!



### **Code Scheduling to Avoid Stalls**

- Reorder code to avoid use of load result in the next instruction
- C code for A = B + E; C = B + F;



### **Control Hazards**

- Branch determines flow of control
  - Fetching next instruction depends on branch outcome
  - Pipeline can't always fetch correct instruction
     Still working on ID stage of branch
- In MIPS pipeline
  - Need to compare registers and compute target early in the pipeline
  - Add hardware to do it in ID stage

### **Stall on Branch**

## Wait until branch outcome determined before fetching next instruction



### **Branch Prediction**

- Longer pipelines can't readily determine branch outcome early
  - Stall penalty becomes unacceptable
- Predict outcome of branch
  - Only stall if prediction is wrong
- In MIPS pipeline
  - Can predict branches not taken
  - Fetch instruction after branch, with no delay

### **MIPS with Predict Not Taken**



The Processor — 34

### **More-Realistic Branch Prediction**

- Static branch prediction
  - Based on typical branch behavior
  - Example: loop and if-statement branches
    - Predict backward branches taken
    - Predict forward branches not taken
- Dynamic branch prediction
  - Hardware measures actual branch behavior
    - e.g., record recent history of each branch
  - Assume future behavior will continue the trend
    - When wrong, stall while re-fetching, and update history

### **Pipeline Summary**

#### **The BIG Picture**

- Pipelining improves performance by increasing instruction throughput
  - Executes multiple instructions in parallel
  - Each instruction has the same latency
- Subject to hazards
  - Structure, data, control
- Instruction set design affects complexity of pipeline implementation

# Agenda

#### A Basic MIPS Implementation

- Building a Datapath
- Designing the Control Unit (single cycle)
- An Overview of Pipelining
  - Pipeline performance
  - MIPS five stages pipeline
  - Hazards: Structure, Data and Control
- MIPS Pipelined Datapath and Control
  - Data Hazards: Forwarding vs Stalling
  - Control Hazards: Branch prediction

# **MIPS Pipelined Datapath**



# **Pipeline registers**

Need registers between stages
 To hold information produced in previous cycle



# **Pipeline Operation**

Cycle-by-cycle flow of instructions through the pipelined datapath

- "Single-clock-cycle" pipeline diagram
  - Shows pipeline usage in a single cycle
  - Highlight resources used
- c.f. "multi-clock-cycle" diagram
  - Graph of operation over time
- We'll look at "single-clock-cycle" diagrams for load & store

## IF for Load, Store, ...



## ID for Load, Store, ...



### **EX for Load**



#### **MEM for Load**



### **WB for Load**



# **Corrected Datapath for Load**



### **EX for Store**



#### **MEM for Store**







sw Write-back

# Multi-Cycle Pipeline Diagram

#### Form showing resource usage



# **Multi-Cycle Pipeline Diagram**

#### Traditional form

add \$12, \$3, \$4

lw \$13, 24(\$1)

add \$14, \$5, \$6

|        |                                                | Time (in clock cycles) |                      |                    |                |                |            |      |      |      |
|--------|------------------------------------------------|------------------------|----------------------|--------------------|----------------|----------------|------------|------|------|------|
|        |                                                | CC 1                   | CC 2                 | CC 3               | CC 4           | CC 5           | CC 6       | CC 7 | CC 8 | CC 9 |
| e<br>o | Program<br>xecution<br>rder<br>n instructions) |                        |                      |                    |                |                |            |      |      |      |
|        | lw \$10, 20(\$1)                               | Instruction<br>fetch   | Instruction decode   | Execution          | Data<br>access | Write back     |            |      |      |      |
|        | sub \$11, \$2, \$3                             |                        | Instruction<br>fetch | Instruction decode | Execution      | Data<br>access | Write back |      |      |      |

Instruction

decode

Instruction

fetch

Data

access

Execution

Instruction

decode

Execution

Instruction

decode

Instruction

fetch

Write back

Data

access

Execution

Write back

Data

access

Write back

Instruction

fetch

# Single-Cycle Pipeline Diagram

#### State of pipeline in a given cycle



# **Pipelined Control (Simplified)**



# **Pipelined Control**

Control signals derived from instructionAs in single-cycle implementation



# **Pipelined Control**



#### **Data Hazards in ALU Instructions**

#### Consider this sequence:

- sub \$2, \$1,\$3
  and \$12,\$2,\$5
  or \$13,\$6,\$2
  add \$14,\$2,\$2
  sw \$15,100(\$2)
- We can resolve hazards with forwarding
  How do we detect when to forward?

# **Dependencies & Forwarding**



## **Detecting the Need to Forward**

Pass register numbers along pipeline

- e.g., ID/EX.RegisterRs = register number for Rs sitting in ID/EX pipeline register
- ALU operand register numbers in EX stage are given by
  - ID/EX.RegisterRs, ID/EX.RegisterRt
  - Data hazards when
    - 1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
    - 1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
    - 2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
    - 2b. MEM/WB.RegisterRd = ID/EX.RegisterRt



MEM/WB

pipeline rea

### **Detecting the Need to Forward**

- But only if forwarding instruction will write to a register!
  - EX/MEM.RegWrite, MEM/WB.RegWrite
- And only if Rd for that instruction is not \$zero
  - EX/MEM.RegisterRd ≠ 0, MEM/WB.RegisterRd ≠ 0

# **Forwarding Paths & Conditions**



b. With forwarding

#### EX hazard

- \* if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) ForwardA = 10
- \* if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) ForwardB = 10

#### MEM hazard

- \* if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRs)) ForwardA = 01
- \* if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

# **Double Data Hazard**

#### Consider the sequence:

add \$1,\$1,\$2 add \$1,\$1,\$3 add \$1,\$1,\$4

- Both hazards occur
  - Want to use the most recent
- Revise MEM hazard condition
  - Only fwd if EX hazard condition isn't true

### **Revised Forwarding Condition**

- MEM hazard
  - if (MEM/WB.RegWrite and (MEM/WB.RegisterRd  $\neq$  0)

and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRs)) and (MEM/WB.RegisterRd = ID/EX.RegisterRs))

ForwardA = 01

 if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0) and not (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0) and (EX/MEM.RegisterRd = ID/EX.RegisterRt)) and (MEM/WB.RegisterRd = ID/EX.RegisterRt)) ForwardB = 01

# **Datapath with Forwarding**



#### **Load-Use Data Hazard**



# **Load-Use Hazard Detection**

- Check when using instruction is decoded in ID stage
- ALU operand register numbers in ID stage are given by
  - IF/ID.RegisterRs, IF/ID.RegisterRt
- Load-use hazard when
  - ID/EX.MemRead and ((ID/EX.RegisterRt = IF/ID.RegisterRs) or (ID/EX.RegisterRt = IF/ID.RegisterRt))
- If detected, stall and insert bubble

# How to Stall the Pipeline

- Force control values in ID/EX register to 0
  - EX, MEM and WB do nop (no-operation)
- Prevent update of PC and IF/ID register
  - Using instruction is decoded again
  - Following instruction is fetched again
  - I-cycle stall allows MEM to read data for Tw
    - Can subsequently forward to EX stage

# **Stall/Bubble in the Pipeline**



# **Stall/Bubble in the Pipeline**



#### **Datapath with Hazard Detection**



# **Stalls and Performance**

#### **The BIG Picture**

Stalls reduce performance
But are required to get correct results
Compiler can arrange code to avoid hazards and stalls

Requires knowledge of the pipeline structure

# **Branch Hazards**

#### If branch outcome determined in MEM



# **Reducing Branch Delay**

- Move hardware to determine outcome to ID stage
  - Target address adder
  - Register comparator
- Example: branch taken

| 36: | sub | \$10,  | \$4,   | \$8 |
|-----|-----|--------|--------|-----|
| 40: | beq | \$1,   | \$3,   | 7   |
| 44: | and | \$12,  | \$2,   | \$5 |
| 48: | or  | \$13,  | \$2,   | \$6 |
| 52: | add | \$14,  | \$4,   | \$2 |
| 56: | slt | \$15,  | \$6,   | \$7 |
|     |     |        |        |     |
| 72: | ٦w  | \$4, ! | 50(\$7 | 7)  |

## **Example: Branch Taken**



## **Example: Branch Taken**



## **Data Hazards for Branches**

If a comparison register is a destination of 2<sup>nd</sup> or 3<sup>rd</sup> preceding ALU instruction



#### Can resolve using forwarding

## **Data Hazards for Branches**

- If a comparison register is a destination of preceding ALU instruction or 2<sup>nd</sup> preceding load instruction
  - Need 1 stall cycle



## **Data Hazards for Branches**

If a comparison register is a destination of immediately preceding load instruction Need 2 stall cycles



## **Branch Hazard and Prediction**

• Static prediction: always predict the same: Speculative running until condition is solved. If error, remove speculative results:

- Effective Prediction (E): branch occurs.
- No Effective prediction (NE): branch does NOT occur.
- Prediction NE if the branch is forward and E if it is back.
- Dynamic Prediction: change the prediction according the branch history. Use a small memory for each branch address (BHT, *Branch History Table*)



## **Dynamic Branch Prediction**

- In deeper and superscalar pipelines, branch penalty is more significant
- Use dynamic prediction
  - Branch prediction buffer (aka branch history table)
  - Indexed by recent branch instruction addresses
  - Stores outcome (taken/not taken)
  - To execute a branch
    - Check table, expect the same outcome
    - Start fetching from fall-through or target
    - If wrong, flush pipeline and flip prediction

# **Calculating the Branch Target**

- Even with predictor, still need to calculate the target address
  - 1-cycle penalty for a taken branch
- Branch target buffer
  - Cache of target addresses
  - Indexed by PC when instruction fetched
    - If hit and instruction is branch predicted taken, can fetch target immediately

## **Branch Target Buffer (BTB)**



#### **Fallacies**

- Pipelining is easy (!)
  - The basic idea is easy
  - The devil is in the details
    - e.g., detecting data hazards
- Pipelining is independent of technology
  - So why haven't we always done pipelining?
  - More transistors make more advanced techniques feasible
  - Pipeline-related ISA design needs to take account of technology trends
    - e.g., predicated instructions

### **Pitfalls**

#### Poor ISA design can make pipelining harder

- e.g., complex instruction sets (VAX, IA-32)
  - Significant overhead to make pipelining work
  - IA-32 micro-op approach
- e.g., complex addressing modes
  - Register update side effects, memory indirection
- e.g., delayed branches
  - Advanced pipelines have long delay slots

# **Concluding Remarks**

- ISA influences design of datapath and control
- Datapath and control influence design of ISA
- Pipelining improves instruction throughput using parallelism
  - More instructions completed per second
  - Latency for each instruction not reduced
  - Hazards: structural, data, control

## Información Adicional

Información adicional para los problemas del capítulo 2

#### Tipos de riesgos por dependencia de datos

Dependencias que se presentan para 2 instrucciones i y j, con i ejecutándose antes que j.
 RAW (Read After Write): la instrucción posterior j intenta leer una fuente antes de que la instrucción anterior i la haya modificado.
 WAR (Write After Read): la instrucción j intenta modificar un destino antes de que la instrucción i lo haya leído como fuente.
 WAW (Write After Write): la instrucción j intenta modificar un destino antes de que la instrucción i lo haya hecho (se modificar un destino antes de que la instrucción i lo haya hecho (se modifica el orden normal de escritura).

| ✓ Ejemplos: <u>RAW</u>                                                                                                          | WAR                                                        | WAW                                             |
|---------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------|-------------------------------------------------|
| ADD <u>r1</u> , r2, r3<br>SUB r5, <u>r1</u> , r6<br>AND r6, r5, <u>r1</u><br>ADD r4, <u>r1</u> , r3<br>SW r10, 100( <u>r1</u> ) | ADD r1, r2, <u>r3</u><br>OR <u>r3</u> ,r4, r5              | DIV <u>r1</u> , r2, r3<br>AND <u>r1</u> ,r4, r5 |
|                                                                                                                                 | En procesadores segmentados<br>con ejecución en orden SÓLO |                                                 |

hay que gestionar los RAW